Aim: To perform differential expression analysis in COPD experiments, contransting COPD vs Control group
We previously had run the following scripts:
0: Data selection /0-Data-selection.RMD
1: Dowload RAW data /1-download_raw-data.RMD
2: Normalizing data /2-normalizing-data.RMD
3: Curating data /3-curatig-data.RMD
This script needs the following files, which were obtained by step 3 /3-curatig-data.RMD:
Table: GSE summary 2020-10-05_GSE_Summary.csv
Data 2: ExpressionSet objects with normalized and curated data 2020-10-05_GSE_LungTissue-CURATED.RDS
For running the script, type:
ssh -X aaltamirano@dna.lavis.unam.mx
qrsh
cd /mnt/Genoma/amedina/DataDNA/R-projects/Meta-analysis_COPD/output_data/.out
module load r/4.0.1
## OR
ssh ana@10.200.0.42
cd /home/ana/DataDNA/R-projects/Meta-analysis_COPD/output_data/.out
nohup R -e "rmarkdown::render(here::here('vignettes/4-DE.RMD'))" > 4-DE.RMD.out &
The script can be found in: /home/ana/DataDNA/R-projects/Meta-analysis_COPD/vignettes. And the directories were set using R/setup.R (e.i. DATA, OUPUT, DOWNLOAD) as a small functions that will paste the names into complete paths.
knitr::opts_knit$set(root.dir = here::here())
source(here::here("R/setup.R"))
And this analysis is run in: /home/ana/DataDNA/R-projects/Meta-analysis_COPD
library(tidyverse)
library(knitr)
library(DESeq2)
library(limma)
library(oligo)
library(vsn)
We selected experiments from lung samples of COPD patients and that also have a control group to compare. These experiments are described in the following table:
gse_table <- read.csv(OUTPUT("2020-10-07_Step3_Summary.csv"), row.names = 1)
kable(gse_table, caption = "GSE information") %>%
kableExtra::scroll_box(width = "100%", height = "100px")
| CONTROL | COPD | COUNTRY | SUBMISSION_DATE | PLATFORM | NUM_GENES | AGE_MEAN | AGE_MEDIAN | AGE_RANGE | SEX_F | SEX_M | SMOKING_STATUS | PACKYEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GSE27597 | 16 | 48 | USA | Mar 01 2011 | GPL5175 | 17325 | 59.62500 | 60.0 | 55-63 | 24 | 40 | NA | 30.00 |
| GSE106986 | 5 | 14 | Germany | Nov 16 2017 | GPL13497 | 21690 | 66.84211 | 69.0 | 44-79 | 7 | 12 | NA | NA |
| GSE37768 | 20 | 18 | Spain | May 04 2012 | GPL570 | 23521 | NA | NA | NA | NA | NA | NA | NA |
| GSE57148 | 91 | 96 | South Korea | Apr 28 2014 | GPL11154 | 25288 | NA | NA | NA | NA | NA | NA | NA |
| GSE47460 | NA | 145 | USA | May 29 2013 | GPL14550 | 15181 | 63.48019 | 65.0 | 23-87 | NA | NA | 1-Current, 2-Ever (>100), 3-Never, 17, 291, 103 | NA |
| GSE38974 | 9 | 23 | USA | Jun 27 2012 | GPL4133 | 19750 | 61.26087 | 62.0 | 44-79 | NA | NA | SMOKER, 32 | 40.00 |
| GSE8581 | 19 | 16 | USA | Jul 12 2007,Jul 13 2007,Jul 17 2007,Jul 20 2007,Jul 24 2007,Jul 25 2007 | GPL570 | 23521 | 65.77193 | 67.0 | 39-84 | NA | NA | NA | |
| GSE1650 | 12 | 18 | USA | Aug 06 2004 | GPL96 | 13516 | NA | NA | NA | NA | NA | NA | NA |
| GSE1122 | 5 | 10 | USA | Mar 09 2004 | GPL80 | 5615 | NA | NA | NA | NA | NA | NA | NA |
| GSE119040 | 4 | 6 | Argentina | Aug 24 2018 | GPL96 | 13516 | NA | NA | NA | NA | NA | NA | NA |
| GSE103174 | 16 | 37 | Spain | Aug 28 2017 | GPL13667 | 15021 | 66.15094 | 67.0 | 48-80 | NA | NA | no, yes, 24, 29 | 40.25 |
| GSE124180 | 15 | 6 | USA | Dec 20 2018 | GPL16791 | 52393 | 62.32857 | 61.3 | 45.3-73.8 | NA | NA | Current, Former, Never, 7, 10, 4 | NA |
Using this funtion, we get a table with differential expression gene results using limma package for fitting a linear model to get genes deferentially expressed between a “Control” and a “COPD” group.
Input: GSE ID, optional: colCOPD is the column name in which the information of disease status can be found, coeff will show results of contrast with coeffitient found in possiton 2
Output: Table of differential expression results with all genes
DE <- function(ExpressionSet,colCOPD="DISEASE",coeff= "DISEASECOPD"){
# it creates the design matrix and performs limma
fit <- lmFit(ExpressionSet, model.matrix(as.formula(paste("~ 1 +", colCOPD)),
data = pData(ExpressionSet)))
# eBayes in lmFit model
ebf <- eBayes(fit)
print(colnames(coef(fit)))
# It gets the genes with the p-values
volcanoplot(ebf,coef = coeff,highlight=20, pch=20)
res <- topTable(ebf, number = Inf, p.value = 1, coef = coeff,confint=T)
# It formats in a tibble
res <- as_tibble(res,rownames="rownames")
}
For microarray experiments, all the following steps are the same, so we will wrap the following code into one function.
Input: Position of the GSE id inside of geo list (e.i. 1,2..)
Output: Table with DE results and a .csv genes
save_DE <- function(i){
g <- geo[[i]][[1]]
message("plotting data")
boxplot(g)
hist(g)
# DE using a local function. DISEASE is a column with disease status information
message("DE analysis")
de <- DE(g,colCOPD = "DISEASE")
# renaming colnames with the GSE id
message("Renaming colnames with GSE id")
colnames(de) <- str_c(colnames(de),"_",names(geo)[i])
colnames(de)
# saving the results in a csv
message("Saving data into csv")
write_csv(de,path= OUTPUT(c("DE/",TODAY,"_Step4_TableGenes_",i,"_",names(geo[i]),".csv")))
# saving the results to later combine all the information
return(de)
}
geo <- readRDS(OUTPUT("2020-10-07_Step3_LungTissue-CURATED.RDS"))
names(geo)
## [1] "GSE27597" "GSE106986" "GSE37768" "GSE57148" "GSE47460" "GSE38974"
## [7] "GSE8581" "GSE1650" "GSE1122" "GSE119040" "GSE103174" "GSE124180"
Using DE() function (described above), we performed a lineal regression model to calculate the logarithm fold change of all genes between a “Control” and a “COPD” group. We also rename colnames adding the GSE ID at the end and finally, we save the output in a .CSV file.
i <- 1
de1 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv
## Warning: The `path` argument of `write_csv()` is deprecated as of readr 1.4.0.
## Please use the `file` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
i <- 2
de1 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv
i <- 3
de3 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv
This experiment is RNAseq, the data will be download from Recount2. I’m following Recount2 vignette Using DESeq2package we calculated DEG.
## Specify design and switch to DESeq2 format
i <- 4
rse <- geo[[i]][[1]]
dds <- DESeqDataSet(rse, ~ DISEASE)
## converting counts to integer mode
## Warning in DESeqDataSet(rse, ~DISEASE): some variables in design formula are
## characters, converting to factors
## Perform DE analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 1439 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
boxplot(log10(assays(dds)[["cooks"]]), range=0, las=2)
ntd <- normTransform(dds)
meanSdPlot(assay(ntd))
res <- results(dds)
plotMA(res, ylim=c(-2,2))
# Calculates de CI
res$error <- qnorm(0.975)*res$lfcSE
res$CI.L <- res$log2FoldChange-res$error
res$CI.R <- res$log2FoldChange+res$error
res
## log2 fold change (MLE): DISEASE COPD vs CONTROL
## Wald test p-value: DISEASE COPD vs CONTROL
## DataFrame with 58037 rows and 9 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.14 924.46223 -0.1311028 0.0763894 -1.716244 0.0861174
## ENSG00000000005.5 2.85737 -0.4396692 0.4629495 -0.949713 0.3422580
## ENSG00000000419.12 1089.25569 0.0152275 0.0300546 0.506663 0.6123911
## ENSG00000000457.13 698.75730 -0.0656469 0.0405149 -1.620314 0.1051648
## ENSG00000000460.16 347.79272 -0.0502240 0.0465180 -1.079669 0.2802896
## ... ... ... ... ... ...
## ENSG00000283695.1 0.0454355 -0.1779953 1.9047251 -0.0934493 9.25547e-01
## ENSG00000283696.1 37.5449930 0.5120456 0.0933836 5.4832482 4.17586e-08
## ENSG00000283697.1 21.2156969 0.0782234 0.1087300 0.7194274 4.71878e-01
## ENSG00000283698.1 0.1327733 -0.1294404 0.9741244 -0.1328787 8.94289e-01
## ENSG00000283699.1 0.0646975 0.1592290 1.3843041 0.1150246 9.08426e-01
## padj error CI.L CI.R
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.14 0.177940 0.1497204 -0.2808232 0.0186176
## ENSG00000000005.5 0.502138 0.9073642 -1.3470334 0.4676951
## ENSG00000000419.12 0.743547 0.0589059 -0.0436783 0.0741334
## ENSG00000000457.13 0.207758 0.0794078 -0.1450547 0.0137609
## ENSG00000000460.16 0.435330 0.0911736 -0.1413976 0.0409495
## ... ... ... ... ...
## ENSG00000283695.1 NA 3.733193 -3.911188 3.555197
## ENSG00000283696.1 5.80204e-07 0.183029 0.329017 0.695074
## ENSG00000283697.1 6.26366e-01 0.213107 -0.134884 0.291330
## ENSG00000283698.1 NA 1.909249 -2.038689 1.779808
## ENSG00000283699.1 NA 2.713186 -2.553957 2.872415
r <- as_tibble(res, rownames="rownames")
r <- full_join(r,as.data.frame(rowData(geo[[i]][[1]])), by=c("rownames"="gene_id"))
r$logFC <- r$log2FoldChange
colnames(r) <- str_c(colnames(r),"_",names(geo[i]))
colnames(r)
## [1] "rownames_GSE57148" "baseMean_GSE57148"
## [3] "log2FoldChange_GSE57148" "lfcSE_GSE57148"
## [5] "stat_GSE57148" "pvalue_GSE57148"
## [7] "padj_GSE57148" "error_GSE57148"
## [9] "CI.L_GSE57148" "CI.R_GSE57148"
## [11] "bp_length_GSE57148" "symbol_GSE57148"
## [13] "GENE.SYMBOL_GSE57148" "logFC_GSE57148"
r <- r[,-12]
colnames(r)
## [1] "rownames_GSE57148" "baseMean_GSE57148"
## [3] "log2FoldChange_GSE57148" "lfcSE_GSE57148"
## [5] "stat_GSE57148" "pvalue_GSE57148"
## [7] "padj_GSE57148" "error_GSE57148"
## [9] "CI.L_GSE57148" "CI.R_GSE57148"
## [11] "bp_length_GSE57148" "GENE.SYMBOL_GSE57148"
## [13] "logFC_GSE57148"
write_csv(r,path= OUTPUT(c("DE/",TODAY,"_Step4_TableGenes_",i,"_",names(geo[i]),".csv")))
de4 <- r
i <- 5
de5 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD"
## [3] "DISEASEInterstitial lung disease"
## Renaming colnames with GSE id
## Saving data into csv
i <- 6
de6 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv
i <- 7
de7 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD" "DISEASEUnclassified"
## Renaming colnames with GSE id
## Saving data into csv
i <- 8
de8 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv
i <- 9
de9 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv
i <- 10
de10 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv
i <- 11
de11 <- save_DE(i)
## plotting data
## DE analysis
## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv
i <- 12
rse <- geo[[i]][[1]]
counts <- t(apply(exprs(rse), 1,as.numeric))
colnames(counts) <- colnames(rse)
colData <- pData(rse)
rse <- SummarizedExperiment(assays=list(counts=counts), colData=colData)
rowData(rse) <- fData(geo[[i]][[1]])
dds <- DESeqDataSet(rse, ~ DISEASE)
## converting counts to integer mode
## Warning in DESeqDataSet(rse, ~DISEASE): some variables in design formula are
## characters, converting to factors
## Perform DE analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 123 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
boxplot(log10(assays(dds)[["cooks"]]), range=0, las=2)
ntd <- normTransform(dds)
meanSdPlot(assay(ntd))
res <- results(dds)
plotMA(res, ylim=c(-2,2))
# Calculates de CI
res$error <- qnorm(0.975)*res$lfcSE
res$CI.L <- res$log2FoldChange-res$error
res$CI.R <- res$log2FoldChange+res$error
res
## log2 fold change (MLE): DISEASE COPD vs CONTROL
## Wald test p-value: DISEASE COPD vs CONTROL
## DataFrame with 54140 rows and 9 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000419 419.9005 0.1982886 0.158400 1.251821 0.210635
## ENSG00000000457 433.5534 -0.0283818 0.105238 -0.269691 0.787398
## ENSG00000000460 108.3336 -0.1826590 0.181100 -1.008609 0.313162
## ENSG00000000938 40.8155 0.5006423 0.492299 1.016947 NA
## ENSG00000000971 2305.8074 -0.0889485 0.186240 -0.477600 0.632935
## ... ... ... ... ... ...
## ENSG00000281904 0.283249 -0.441699 2.912079 -0.151678 0.8794407
## ENSG00000281909 0.360405 -0.580817 1.627405 -0.356898 0.7211683
## ENSG00000281910 0.000000 NA NA NA NA
## ENSG00000281912 3.582084 -0.352601 0.477911 -0.737796 0.4606383
## ENSG00000281920 1.303847 -2.045900 1.059108 -1.931720 0.0533941
## padj error CI.L CI.R
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000419 0.934453 0.310459 -0.112170 0.508747
## ENSG00000000457 0.981989 0.206263 -0.234645 0.177881
## ENSG00000000460 0.952026 0.354949 -0.537608 0.172290
## ENSG00000000938 NA 0.964889 -0.464247 1.465531
## ENSG00000000971 0.967615 0.365024 -0.453973 0.276076
## ... ... ... ... ...
## ENSG00000281904 NA 5.707570 -6.14927 5.2658709
## ENSG00000281909 NA 3.189655 -3.77047 2.6088379
## ENSG00000281910 NA NA NA NA
## ENSG00000281912 0.961404 0.936689 -1.28929 0.5840877
## ENSG00000281920 NA 2.075814 -4.12171 0.0299139
r <- as_tibble(res, rownames="rownames")
r <- full_join(r,as.data.frame(rowData(rse)), by=c("rownames"="GENEID"))
r$logFC <- r$log2FoldChange
colnames(r) <- str_c(colnames(r),"_",names(geo[i]))
colnames(r)
## [1] "rownames_GSE124180" "baseMean_GSE124180"
## [3] "log2FoldChange_GSE124180" "lfcSE_GSE124180"
## [5] "stat_GSE124180" "pvalue_GSE124180"
## [7] "padj_GSE124180" "error_GSE124180"
## [9] "CI.L_GSE124180" "CI.R_GSE124180"
## [11] "GENENAME_GSE124180" "SEQNAME_GSE124180"
## [13] "GENESEQSTART_GSE124180" "GENESEQEND_GSE124180"
## [15] "GENE.SYMBOL_GSE124180" "logFC_GSE124180"
write_csv(r,path= OUTPUT(c("DE/",TODAY,"_Step4_TableGenes_",i,"_",names(geo[i]),".csv")))
# saving the results to later combine all the information
de12 <- r
This script produces the following data, and can be found in /home/ana/DataDNA/R-projects/Meta-analysis_COPD
Tables with DE results: Tables with log fold change and p-values calculated per experiment and saved in csv files
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=it_IT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] vsn_3.56.0 oligo_1.52.1
## [3] Biostrings_2.56.0 XVector_0.28.0
## [5] oligoClasses_1.50.4 limma_3.44.3
## [7] DESeq2_1.28.1 SummarizedExperiment_1.18.2
## [9] DelayedArray_0.14.1 matrixStats_0.57.0
## [11] Biobase_2.48.0 GenomicRanges_1.40.0
## [13] GenomeInfoDb_1.24.2 IRanges_2.22.2
## [15] S4Vectors_0.26.1 BiocGenerics_0.34.0
## [17] knitr_1.30 forcats_0.5.0
## [19] dplyr_1.0.2 purrr_0.3.4
## [21] readr_1.4.0 tidyr_1.1.2
## [23] tibble_3.0.3 ggplot2_3.3.2
## [25] tidyverse_1.3.0 stringr_1.4.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.4-1 ellipsis_0.3.1 rprojroot_1.3-2
## [4] fs_1.5.0 rstudioapi_0.11 hexbin_1.28.1
## [7] farver_2.0.3 affyio_1.58.0 bit64_4.0.5
## [10] AnnotationDbi_1.50.3 fansi_0.4.1 lubridate_1.7.9
## [13] xml2_1.3.2 codetools_0.2-16 splines_4.0.2
## [16] geneplotter_1.66.0 jsonlite_1.7.1 broom_0.7.1
## [19] annotate_1.66.0 dbplyr_1.4.4 BiocManager_1.30.10
## [22] compiler_4.0.2 httr_1.4.2 backports_1.1.10
## [25] assertthat_0.2.1 Matrix_1.2-18 cli_2.0.2
## [28] htmltools_0.5.0 tools_4.0.2 gtable_0.3.0
## [31] glue_1.4.2 GenomeInfoDbData_1.2.3 affy_1.66.0
## [34] affxparser_1.60.0 Rcpp_1.0.5 cellranger_1.1.0
## [37] vctrs_0.3.4 preprocessCore_1.50.0 iterators_1.0.12
## [40] xfun_0.18 rvest_0.3.6 lifecycle_0.2.0
## [43] XML_3.99-0.5 zlibbioc_1.34.0 scales_1.1.1
## [46] BiocStyle_2.16.1 hms_0.5.3 RColorBrewer_1.1-2
## [49] yaml_2.2.1 memoise_1.1.0 stringi_1.5.3
## [52] RSQLite_2.2.1 highr_0.8 genefilter_1.70.0
## [55] foreach_1.5.0 BiocParallel_1.22.0 rlang_0.4.7
## [58] pkgconfig_2.0.3 bitops_1.0-6 evaluate_0.14
## [61] lattice_0.20-41 labeling_0.3 bit_4.0.4
## [64] tidyselect_1.1.0 here_0.1 magrittr_1.5
## [67] R6_2.4.1 generics_0.0.2 DBI_1.1.0
## [70] pillar_1.4.6 haven_2.3.1 withr_2.3.0
## [73] survival_3.2-3 RCurl_1.98-1.2 modelr_0.1.8
## [76] crayon_1.3.4 rmarkdown_2.3 locfit_1.5-9.4
## [79] grid_4.0.2 readxl_1.3.1 blob_1.2.1
## [82] webshot_0.5.2 reprex_0.3.0 digest_0.6.25
## [85] xtable_1.8-4 ff_4.0.2 munsell_0.5.0
## [88] viridisLite_0.3.0 kableExtra_1.2.1